Cannabis sativa SimpleTidy_GeneCoEx

Author

Karen Cristine Goncalves

Published

July 4, 2025

Initial data overview

We found a total of 130 RNAseq datasets for Cannabis sativa available in NCBI, belonging to 6 different projects, 6 different tissues/organs, and sequenced from both single- and paired-end libraries (with variable read sizes, from 35 to 150).

BioProject Number of Samples
PRJNA1108719 3
PRJNA1126191 20
PRJNA1173925 12
PRJNA560453 64
PRJNA73819 8
PRJNA80055 23
Tissue Number of Samples
Flower 50
Leaf 19
Meristem 20
Root 10
Stem 7
Trichome 24
Sequencing layout Number of Samples
PE 104
SE 26

Read trimming and filtering (fastp) was repeated to include the shorter-read samples:

  • for samples with original average read length < 40 bases, the minimum size allowed after trimming was 30 bases
  • for samples with original average read length > 40 bases, the minimum size allowed after trimming was 50 bases.

Read mapping, counting and TPM computation

Reads were aligned to C. sativa’s female genome (downloaded from Ensembl Plants release 61: genome and GTF annotation) using STAR.

Given the difference in average read lengths between the datasets, the genome indexing step was taking using four different values for the --sjdbOverhang parameter: 30, 70, 90 and 144.

The option --quantMode was added to STAR to obtain read counts. Strandness was infered using RSeQC infer_experiment.py script, alignment BAM files and the genome annotation in BED format.

Subsequently, read counts (for each replicate, according to their strandness) were combined into a count matrix using R 4.4.0 and the tidyverse package.

For TPM computation, the GTF file was used to compute gene lengths (through exon overlap) (using R packages rtracklayer and GenomicRanges).

PCA and gene filtering

Before computing a minimum expression threshold, a principal component analysis was performed to detect potentially aberrant samples (eg. samples of the same bioproject clustering together despite originating from different tissues). For this, genes that were not detect in any sample were removed (overall read count ); then PCA was computed with (log10(TPM+1)).

Since no clear issues were visible, the analysis proceeded with all datasets.

Filtering

After filtering weakly expressed genes, 10617 were removed from the dataset and 16632 were kept (61.04% of all genes).

PCA post gene filtering

No obvious difference was observed in the PCA before and after gene filtering, indicating that most of the genes removed from the dataset had TPM = 0 in all samples (as those genes were removed prior to the initial PCA).

Gene expression variation

To detect clusters of genes co-expressed in different tissues, the genes with most variable expression in relation to the tissues were selected.

For this, TPM levels for each gene were averaged for each tissue, then the relative variance (\(variance/mean\)) of the averaged TPM values was computed. Next, the top 20% genes with highest relative variance of TPM were selected.

Table 1: Possible genes in the pathway of interest. Column ‘In network’ indicates highly variable genes selected for network construction.

Of the 35 bait genes, 15 were included in the network construction.

Gene correlation

With the most variable genes selected, TPM levels were transformed into z-scores (to make the expression levels of different genes comparable), then Pearson correlation score was computed. The higher the correlation score, the more similar the expression of two genes across the different tissues in the dataset.

To filter the correlation table, correlation scores were first binned by rounding to 5 significant digits; for each bin, the total number of edges (TE) and the number of significant edges (SE) found with r scores equal or greater than the current bin were counted, then SE was divided by TE to obtain the cumulative proportion of significant correlations. Finally, the correlation threshold was selected as the lowest positive r score with cumulative proportion of significant correlations > 0.9.

Note
  • TE = total number of edges
  • SE = number of significant edges
  • CPSC = cumulative proportion of significant correlations
  • rbin = binned r score values
  • r = r score

\[CPSC = \frac{ SE~r \geq rbin~} {TE~r \geq rbin~ }\]

r threshold = lowest positive rbin value with CPSC > 0.9

Correlation threshold = 0.9868

Network construction

The network requires genes and their correlations (a correlation threshold) and a resolution level.

Resolution

In a network, a cluster is a group of genes correlated to each other. When few genes are correlated to each other in a cluster, a low resolution may allow the distinction of the individual genes or of smaller clusters. Clusters formed by strongly correlated genes (higher correlation scores) require higher resolution to distinguish individual genes or of smaller clusters.

To optimize the resolution parameter, the function cluster_leiden was repeated using multiple resolution parameters (from 0.25 to 5, with steps of 0.25). Then, clusters with fewer than 5 genes were discarded and the number of genes in each detected cluster and the number of detected clusters was computed. For this step, the resolution selected was the highest one which kept all genes used in the network construction.

20 30 40 1000 1500 2000 2500 1 2 3 4 5 Resolution Modules Genes in modules
Figure 5

With resolution set to 0.5 , the genes of interest were found in clusters 4, 5, 6, 8, 9, 34 and 37 (table 2).

3 2 5 4 9 8 6 12 20 21 34 30 35 37 42 49 Trichome Flower Meristem Leaf Stem Root Module z-score 0 > 1.5 r threshold = 0.9868 resolution = 0.5 514 437 426 379 313 267 157 111 12 11 10 7 6 5 5 5 Genes in module
Figure 6: Average expression (z-score of TPM levels) of genes in each cluster and each tissue.
Table 2